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Abstract 

Compressed sensing is a recent set of mathematical results showing that sparse signals can be exactly 
reconstructed from a small number of linear measurements. Interestingly, for ideal sparse signals with 
no measurement noise, random measurements allow perfect reconstruction whUe measurements based on 
principal component analysis (PCA) or independent component analysis (ICA) do not. At the same time, 
for other signal and noise distributions, PCA and ICA can significantly outperform random projections in 
terms of enabhng reconstruction from a small number of measurements. In this paper we ask: given the 
distribution of signals we wish to measure, what are the optimal set of linear projections for compressed 
sensing? We consider the problem of finding a small number of linear projections that are maximally 
informative about the signal. Formally, we use the InfoMax criterion and seek to maximize the mutual 
information between the signal, x, and the (possibly noisy) projection y — Wx. We show that in 
general the optimal projections are not the principal components of the data nor random projections, but 
rather a seemingly novel set of projections that capture what is still uncertain about the signal, given the 
knowledge of distribution. We present analytic solutions for certain special cases including natural images. 
In particular, for natural images, the near-optimal projections are bandwise random, i.e., incoherent to 
the sparse bases at a particular frequency band but with more weights on the low-frequencies, which has 
a physical relation to the multi-resolution representation of images. 

Index Terms 

Compressed sensing, InfoMax principle, uncertain component analysis, sensor capacity, informative 
sensing. 
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I. Introduction 

Compressed sensing [1], [2] is a set of recent mathematical results on a classic question: given a signal 
X e and a set of p linear measurements y = Wx G W, how many measurements are required to 
enable reconstruction of xl Obviously, if we knew nothing at all about x, i.e. x can be any d dimensional 
vector, we would need d measurements. Alternatively, if we know our signal x lies in a low-dimensional 
hnear subspace, say of dimension k, then k measurements are sufficient. But what if we know that x 
Ues in a low-dimensional nonlinear manifold? Can we still get away with fewer than d measurements? 

To motivate this question, consider the space of natural images. An image with d pixels can be thought 
of as a vector in W^, but natural images occupy a tiny fraction of the set of all signals in this space. If 
there was a way to exploit this fact, we could build cameras with a small number of sensors that would 
still enable us perfect, high resolution, reconstructions for natural images. 

The basic mathematical results in compressed sensing deal with signals that are k sparse. These are 
signals that can be represented with a small number, k of active (non-zero) basis elements. For such 
signals, it was shown in [1], [2], that cklogd generic linear measurements are sufficient to recover the 
signal exactly (with c a constant). Furthermore, the recovery can be done by a simple convex optimization 
or by a greedy optimization procedure [3]. 

These results have generated a tremendous amount of excitement in both the theoretical and practical 
communities. On the theoretical side, the performance of compressed sensing with random projections 
has been analyzed when the signals are not exactly k sparse, but rather compressible (i.e. can be well 
approximated with a small number of active basis elements) [1], [2] as well as when the measurements 
are contaminated with noise [4]-[6]. On the practical side, apphcations of compressed sensing have been 
explored in building "single-pixel" cameras [7], medical imaging [8], [9] and geophysical data analysis 
[10]. 

Perhaps the most surprising result in compressed sensing is that perfect recovery is possible with 
random projections. This is surprising given the large amount of literature in machine learning and 
statistics devoted to finding projections that are optimal in some sense (e.g. [11]). In fact, as we see in 
this paper, for white sparse signals, random measurements significantly outperform measurements based 
on principal component analysis (PCA) or independent component analysis (ICA). At the same time, for 
other signal and noise distributions, PCA and ICA can significantly outperform random projections. 

In this paper we ask: given a distribution or statistics of the signals we wish to measure, what are the 
optimal set of linear projections for compressed sensing? We show that the optimal projections are in 
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general not the principal components nor the independent components of the data, but rather a seemingly 
novel set of projections that capture what is still uncertain about the signal, given the signal distribution. 
We present analytic solutions for various special cases, including natural images, and demonstrate, by 
experiments, that the projections onto the uncertain components may far outperform random projections. 

II. Informative Sensing 

In [12], Linsker suggested the InfoMax principle for the design of a linear sensory system. According 
to this principle, the goal of the sensory system is to maximize the mutual information between the 
sensors and the world (see also [13]-[15]). In this paper, we are interested in undercomplete InfoMax 
where the number of sensors is less than the dimensionality of the input. Given that dimensionality 
reduction throws away some information about the input, does it still make sense to maximize mutual 
information in the undercomplete case? 

Consider an example, shown in Fig. [T] where a 1-dimensional measurement is taken on a 2-dimensional 
signal distributed in a mixture of four Gaussians. The black line in each figure denotes a projection vector, 
and the red line shows the Bayes least-squares (BLS) estimate of signals given the projection. When a 
random projection (left) or a single PCA projection (middle) are used, the BLS estimate is quite far from 
the original data, indicating that a lot of information has been lost. But the InfoMax projection (shown 
on the right) provides much more information, making the BLS decoding significantly better. 




(a) (b) (c) 

Fig. 1 . Different kinds of one-dimensional projection schemes and tlieir reconstruction results for a two-dimensional Gaussian 
mixture source, (a) Random, (b) PCA, (c) InfoMax projections. Blue points: data samples. Black line: projection vector. Red 
curve: reconstruction based on the BLS estimate, x = E[x\y\. 

The key to this result lies in the nonlinearity of the decoding scheme. It is well known that projections 
based on PCA give optimal reconstruction, in terms of mean squared error (MSB), if the decoding 
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is restricted to be linear. But, if the decoding is allowed to be nonlinear, the optimal projection may 
significantly differ from the PC A projection. In this regard, compressed sensing of sparse signals [1], [2] 
is a spectacular demonstration of nonlinear decoding from a small number of linear projections. 

Let X and y be the sensor input (original signal) and sensor output (measurements) related by y = 
Wx + r/, where is a p x d matrix {p < d) and rj represents the sensor noise assumed to be Gaussian, 
G(0,cr2l). Our problem may be formally defined as 

W* = aigmSiyiI{x;Wx + rj). (1) 
w 

Without constraint on W , the mutual information can be made arbitrarily large, simply by scaling W. 
To preclude such a trivial manipulation, we restrict W to satisfy the orthonormality condition, i.e., 
WW^ = /Q 

Note that ([T]) looks quite similar to the definition of channel capacity. Considering the sensing process 
as a channel, we may call I{x; Wx + r/) sensor capacity, which measures how informative the sensor is. 
However, as noted by Linsker, there is a crucial difference between the channel capacity problem and 
the InfoMax problem: In ([Til, the source signal has a certain probability distribution, while the optimal 
choice of channel W is actually being sought for, which is exactly the reverse case of the channel capacity 
problem. 

A simple alternative to © is to use h{y), which characterizes the information content of the measure- 
ments [6], instead of I{x; y) because I{x; y) = h{y) — h{y\x) and h{y\x) is merely the entropy of the noise 
r], which is invariant to W. Besides its simplicity, this objective function has another desirable property 
that it is well defined even without noise, unlike I{x; Wx) which diverges to infinity. In Appendix |Al 
we will further discuss the validity of h{Wx) with WW'^ = / as an objective function for the noiseless 
condition. 

The values of h{Wx) for the sensing schemes in Fig. [T] are numerically evaluated and compared 
in Table Jl Indeed, the scheme with highest entropy corresponds to the best reconstruction. Although 
this needs not always be the case, InfoMax and reconstruction are closely related. This can be seen 
by considering the cost function suggested in [16]: L(W) = Y[n^^i^n\yn',W) where x„ are samples 
drawn from Pr(x) and y„ = Wxn- It is easy to see that In L{W) —h{x\y) so that maximizing h{y) 
is equivalent to maximizing the probability of correct reconstruction of a sample given its projection. 

'in certain applications, it makes more sense to limit the total available power and replace the orthonormality constraint with 
tr(WW^"^) < P for a fixed budget P. For the noiseless case, the two constraints can be shown to result in the same optimal 
set of projections. 
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TABLE I 

Performance Comparison among Three Projection Schemes in Example of Fig.[T] in terms of Entropy 

h{Wx) AND Mean Squared Error (MSE). 





a 


b 


c 


Hy) 


1.189 


1.345 


1.449 


MSE 


0.931 


0.726 


0.476 



Minimizing uncertainty has also been proposed recently in the context of sequential design of compressed 
sensing by [6], [17]. In this context, the projections are chosen sequentially where each projection 
mimimizes the remaining uncertainty about the signal given the results of the previous projection. 

III. Analysis 

The optimal projections maximizing h{y) vary according to the prior distribution of the source signal 
X. Multi-variate Gaussian is a special kind of signal whose optimal projection can be found analytically, 
but, in most cases, the optimal projection is hard to find in a closed-form because of the complicated 
nature of differential entropy. 

For a p-dimensional random vector y whose covariance is T,y, its entropy h{y) can be decomposed 
into 

hiy) = hiy) + ^lndeti^y) (2) 

where y is a whitened version of y, e.g., y = Sj, '^y. Note that h(jj) is covariance-free, depending only 
on the shape of the probability distribution of y. It is well known that h{y) is maximized to | ln(27re) if 
and only if y is jointly Gaussian. On the other hand, the second term, ^ In det{Y,y), depends only on T,y, 
the covariance of y. So, we will call h{y) the shape term and ilndet(Sy) the variance term. Overall, 
an entropy is a sum of the shape term and the variance term. 

In the following, we present some analytical results for undercomplete InfoMax, based on the entropy 
decomposition, and make comparisons between two popular projection schemes (PCA and random 
projections) for various special cases, which provides us better understanding on the desirable behaviors 
for the informative sensing. 
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A. White Signals 

Observation 1: For white data, the projected signal must be as Gaussian as possible to be most 
informative. 

Proof: For white data, the variance term goes away and only the shape term remains. Therefore, 
the InfoMax is really achieved by the projection which can maximize the shape term. ■ 

Observation 2: For white data x with zero-mean and finite variances, if the distribution of ||x|| / Vd 
is degenerated to a constant, p random projections, for p < 0{^fd), are asymptotically optimal if d goes 
to infinity. 

Proof: In [18], Dasgupta et al. have shown that almost all p linear projections behave like a scale- 
mixture of zero-mean Gaussians with variances that have a profile that is the same as the distribution of 
||x|| /^fd. If ||x|| /^fd goes to a Dirac's delta function, the mixture will collapse to a single Gaussian. 
Specifically, in this case, the main theorem of [18] reads approximately like the following: For any ball 
B (^W and for almost all W, 

sup |Pr(B; W) - Vi{B)\ < Oip^/dfl^ (3) 

where sup represents the supremum and where Pr(-;M^) and Pr(-) denote the probability with respect 
to the probability density function (pdf) of Wx and the probability with respect to the Gaussian pdf, i.e. 
G(0, ^/). \ip< 0{Vd), the error bound goes to zero with d ^ oo. By observation [T] therefore, p 
random projections are asymptotically optimal for such a white data. ■ 

Example 1 : Consider a specific type of white data x that satisfies the source separation generative 
model X = Vs, where Sk are iid with zero-mean and unit- variance and where V e M.'^^'^ is orthonormal. 
Because ||2;|| = ||s|| /\Q -y/Var(sfc) = 1 with d — > oo, the observation [2] applies, suggesting that 
p random projections, for any fixed p, are maximally informative as the input dimension goes to infinity. 

Example 2 ("compressed sensing of sparse signals") : Consider x = Vs as in the example 1, but with 
s being k sparse where the nonzero elements are iid. If k is 0{logd) and p = ck log d, then random 
projections are optimal because p < 0{\fd). 

While random projections are asymptotically Gaussian for white data, we are also interested in eval- 
uating their entropy for large but finite-dimensional data. We now develop an explicit approximation for 
the expected value of the entropy of p random projections in d dimensions, where both p and d are finite. 
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Specifically, let us consider x = Vs, as in the example 1, where Sk follows a generalized Gaussian 
(GG) distribution. A random variable x is said to be GG{x;a,fi,a'^) if its pdf is given by 



p{x) 



a 



■ exp 



X — /i 



(4) 



pjg^"! , where /i is the mean of the distribution, a is the standard deviation, and a 



(i) 

for fj, a > and (3 

is known as the shape parameter. We will simply denote the distribution by GG{a) wherever fj, and a are 
not specific. The GG includes a number of well-known pdfs as its special cases: GG{1) is a Laplacian 
exponential distribution, GG{2) corresponds to a Gaussian distribution, whereas in the limiting cases 
where a — > 0, a degenerate distribution in x = /i is obtained. In general, if a < 2, the distribution is 
sparse, with the degree of the sparsity determined by a (the smaller, the sparser). The shape term of 
GG{a) is computed to 

'4r3fi)\ 1 



In 



2 I a2 r(|) 



+ 



a 



(5) 



in nats, as drawn in Fig |2l 
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Fig. 2. Unit-variance entropy (shape term) of GG(a) for various values of a. 



The net increment of entropy (in its expectation) by adding a new /cth random projection to prese- 
lected {k — 1) random projections {yi}\zl is represented by a conditional entropy E[h{yk\yi, ■ ■ ■ , Vk-i)], 
which we will call the individual capacity of the random projection and denote by i^(A;)o It is easy to see 



Note that this quantity is relative (i.e. meaningful only in comparisonal sense) because it is a kind of differential entropy. 
Although an individual capacity may take a negative value, including the projection "additionally" to the preselected set of 
projections is always better than not. 
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that u{k) monotonically decreases along k because, for any i < j, = E[h{yj\yi, . . . , yi-i, . . . , Vj-i)] 
< E[h(yj\yi, . . . , yj-i)] that is equal to due to the symmetry among i/j's. For the white data x = Vs, 
all the random projections are kept uncorrected with each other as long as they satisfy the orthonormal 
constraint WW'^ = I. However, the uncorrelatedness does not lead to independency, which is manifest 
by the following: From the observation [2] or by the central limit theorem, the first random projection will 
be maximally informative, so 

h{yi) > h{sk), yk. (6) 

At another extreme, any nondegenerate d-dimensional projections should have the same capacity because 
they are bijective and perfectly describe the original signal x, so 

^iy{k)=^hisk)=hix), (7) 

k k 

which may be understood as total capacity preservation. To satisfy Q and ([7]) at the same time, the 
individual capacity of random projections should decrease. Since we consider white data whose the 
projections are uncorrelated with each other, we imagine the dependency among the projections to be 
small and suggest to ignore high-order multi-information terms. 

Observation 3: For white data x = Vs, where V E M'^^'^ is orthonormal and s has a GG distribution 
given by ( fT3l ). the expected value of the entropy of p random projections is 

p(p — 1) 

E[h{yi,...,yp)]^pc2 ^ _ ^ (c2 - Ca) (8) 

for large d, where Cq, denotes the shape term of a GG{a) random variable and C2 is a particular quantity 
when a = 2 (i.e. the shape term of a Gaussian random variable), if we ignore higher-order multi- 
information terms (than pairwise dependency). 

Proof: The joint entropy of k random projections yi, . . . , can be approximated by 

Hvi^ • • • , yfc) ~ X] ~ X] ^(yi'^yj) (9) 

i i<j 

if we neglect higher-order multi-information terms. Again due to the symmetry among yj's, we may write 

E[hiyi)] = he, yi, E[I{yi-yj)] = Vi / j. (10) 
Inserting ( fTOb into we obtain 

E[h{yi,...,yk)]^kh,- ^^^~^h ,. (11) 
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If d is sufficiently large, he approximates to C2 by the central limit theorem, and Ic is determined to 
(c2 — Ca) by the total capacity preservation rule. After all, we have ([8]l, deviated from a Gaussian by 
^^rr^(c2 — Ca)- Interestingly, such an order 0{p'^/d) appears also in [18] (e.g. see Q). ■ 

The observation |3] also tells us that ^{k) decreases approximately linearly under the assumption that 
we may ignore the high-order multi-information terms because 

2(k — 1) 

u{k) = E[h{yi,. . .,yk)] - E[h{yi, . . . ,yk-i)] = C2 — — (c2 - Cq). (12) 

a — 1 

Indeed, the property of random projections that makes u{k) tilted is attractive for informative sensing 
because it concentrates large capacity on the first p projections while keeping the remaining uncertainty 
(i-^- Sfc=p+i ^{^)) small. Imaginably, the ideal projections would be the one that concentrates "all" 
capacity on the first few projections (i.e. with capacity falling like a mirrored step function). In reality, 
there might exist no such ideal projections that should form an exact Gaussian as long as they are non- 
degenerate, yet the linear concentration property maintains the random projections still close to Gaussian. 
In asymptotic case {d — > cxd), the linear concentration property meets v{k) k, c2 for any fixed k, making 
p random projections, for any fixed p, look like a Gaussian and thus be apparently optimal. 
For the white data x = Vs, where p{s) is given by 

d 

p{s) = J{GG{sk-,a,{),l), (13) 

fc=i 

for some values of a, h{yi, . . . , yp) of random projection and the PGA projection can be computed to 



I) Random: 



2) PCA: 



£;[%!,..., yp)] -pc2 + ^^f ]\ c2-ca). (14) 



Hyi,. ■ ■ ,yp) = pca- (15) 

For all a < 2, random projection performs better than the PCA (or ICA) projection, as shown in Fig. |3l 
and the performance gap is amplified when a is small (i.e. the distribution is sparse). 

B. Non-white Signals 

Observation 4: For Gaussian data, the PCA projection is optimal, i.e., maximally informative. 

Proof: If the PCA projection happens to be Gaussian, both individual terms of the entropy are 
maximized: the shape term by the Gaussianity and the variance term by the PCA property. Thus, the 
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Fig. 3. Relative performance of random projection over PCA projection for a white GG source whose shape parameter is a. 
The original dimension d of the source is 2^** {— 65, 536). 

overall entropy is also maximized. This situation happens for a Gaussian source. Therefore, this provides 
a simple proof that the optimal projection for Gaussian must be the first p principal components. ■ 

For instance, consider a d-dimensional Gaussian distribution whose variance falls off as l/k"^ along 
each axis. A; = 1, . . . , d, for a positive value of 7. If we compute h{yi, . . . , Up) of random projection and 
the PCA projection, using the entropy decomposition of ([2]), 

1) Random: 

E[h{yi,...,yp)] =pc2 + ^lnE 

where Volp(Ai, . . . , A^^) denotes the volume of a p-dimensional hypercube whose edge lengths are 
randomly (but without repetition) chosen from Ai, . . . , A^^ (see Appendix IB] for how to compute Volp(-)). 

2) PCA: 

p 

1 \ ^ 

^(yi,---,yp) =^'C2 - -2^1nA;. (17) 

k=l 

For all 7 > 0, the PCA projection performs better than random projection, as shown in Fig. |4l and greater 
7 results in larger performance gap. 

Now consider a hybrid (i.e. sparse but non-white) type of the preceding two examples so that 

p{x) = \{GG(xk;a,^,^\. (18) 




VoL 1,— 



2^ ' 



(16) 
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We can simply approximate the relative performance of random projection over the PCA projection by 
summing two graphs, each from Figs. |3] and SI As illustrated in Fig. \5\ either one may not consistently 
outperform the other for all range of p (see the cases of a = 0.3, 7 = 2 in Fig. |5(a)| and a = 0.5, 7 = 1 
in Fig. |5(b)| ). However, if a = 0.5 and 7 = 2, for example, the PCA projection always does better than 
random projection. 

C. Noisy Measurement 

Observation 5: If the noise variance cr^ is large, the PCA projections are optimal. 

Proof: This was proven in [16] and makes sense because major principal components are most 
durable to noise. Another sketch of the proof can be given as follows: The noisy measurement process 
y = Wx + rj can be rewritten as 

y = W{x + ri^) (19) 

where rjx ~ G(0, o"^/) for an orthonormal matrix W . Then, we can pretend as if x' = x + rjx were the 
original source under the noiseless measurement process, for the simple purpose of maximizing h{y). 
If a is large, rj^ dominates x in x' = x + rjx, making the overall distribution Gaussian. Then, from 
observation ID the PCA projections are optimal. ■ 

Indeed, the rate of convergence (to Gaussian) with increase of a is fast. For scalar random variables 
X ~ GG(a,0, A) and rjx ~ G(0, o"^), the shape term c'^ oi x' = x + r]x can be computed by (see 
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where x ~ GG(a, 0,1) and ~ G(0, 1). Therefore, c'^ is a function of a and the signal-to-noise 
ratio (SNR) A/o"^. Unfortunately, we could not find further simplification for (l20l ). but still evaluate 
it numerically. Fig. [6] illustrates how changes with respect to the SNR for some values of a. As 
shown in the figure, it grows quite rapidly to C2 (i.e., Gaussian), even with a relatively small amount 
of noise. In practical applications, measurement noises are often unavoidable and can significantly affect 
the informativeness of each set of projections. 

To summarize, neither random projections nor PCA and ICA are in general the best projections for 
informative sensing. We showed that for white signals random projection is near-optimal, while for 
Gaussian signals PCA is optimal. For power-law sparse signals, PCA is better than random unless the 
signals are extremely sparse. Even for extremely sparse power-law signals, PCA outperforms random 
with a small amount of noise. Beyond the results, this section also motivates the necessity of a new type 
of projections, which is universally optimal for informative sensing. Since we actually seek to maximize 
h{y), the uncertainty of a set of linear projections of data, we call the optimization scheme uncertain 
component analysis (UCA). 
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IV. Informative Sensing for Multi-Resolution Signal Models 

As we saw in Fig. [5j in general neither random projections nor PCA projections are optimal for non- 
white signals. In this section, we will consider some practically important class of signals that can be 
represented by multi-resolution signal models [19] 

where Vk and Xg denote the independent components of x and the variance of Sk = vj^x. This represen- 
tation is based on the source separation generative model x = Vs where V is an orthonormal mixing 
matrix whose columns consist of the independent components, i.e., V = [vi ... Vd\. In this model, the 
independent components are grouped into (L + 1) different bands by their resolution and the independent 
components of the same band (e.g. B^) are assumed to be iid, having the same pdf ip with the same 
variance A^. We will assume, without loss of generality, that Aq > Ai > A2 > • • • > A/,. Often, the 
variance gap between two adjacent bands is known to be quite large, i.e., A^ ^ A^+i, and ifj is modeled 
by GG{a) for some a. 

Natural images are a good example of such signals [20]. The independent components {vk} for a 
natural image are a set of Gabor-like filters in multi-level resolutions [11], with the variance of v^x 
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Fig. 6. Variation of tlie shape term of GG{a) by adding Gaussian noise. Note tliat the horizontal axis indicates the reciprocal 
of the SNR, not the SNR itself. 
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satisfying 



Var(f^x) = ^ O 



4^ 



(22) 



remarkably well, according to the power spectral statistics [21]. The heavy-tailedness in the distribution 
of v'J^x is also well known and has been widely modeled by v'^x ~ GG{a) with a < 1 in the literature 
[22]-[24]. 



A. Derivation of UCA Projections 

Despite the simplicity of the signal model in (|2TI ). it seems infeasible to derive the true optimal 
projection that maximizes h{y). Here we refrain ourselves from mixing s^s across bands and seek to 
find a suboptimal solution among the bandwise projections in which W is in the form of 

Wo ••• 
Wi ■■■ 



W 







Wl 



(23) 



In (|23] ). each submatrix Wi is of x with pi variable while satisfying '}2d=o'Pi = P- If a matrix W' 
that is not bandwise itself can be made bandwise by rotations, i.e. by premultiplying a unitary matrix U, 
it is congruent to the bandwise matrix UW because h{UW'x) = h{W'x). In other words, a bandwise 
matrix W , with a particular set of p^'s, simulates a bunch of (although not all) matrices W' whose 
power profile is the same as the distribution of {pi}, i.e., YTi=i Ylj^Bi — Vt- Thus, our bandwise 

restriction in fact includes more projections than explicitly shown in (1231 ). 

With this restriction on the projection matrix structure, we have the following observation: 
Observation 6: For the signal model in (|2TI ). if we consider only the bandwise projections in the form 
of (|23] ) with p^'s fixed, random Wi are near-optimal as \Bf\ goes to infinity. 

Proof: By construction of our model, any bandwise projections taken from different bands are 
mutually independent, which implies that the optimal set of projections for each band can be found 
separately. Since each band is white, observation [2] and subsequent arguments can apply here, which 
gives us bandwise random projection. ■ 

Now the only remaining job is to determine p^. For each band B^, the expected value of the entropy 
behaves like (fT2l ) with d substituted by \Bi\. To illustrate this, Fig.|7]shows the individual capacity v{k) of 
the bandwise random projections, where and A^+i denote the shape term improvement by randomly 
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Fig. 7. Individual capacity diagram for bandwise random projections. 

mixing the s^'s of the same band, approximating to C2 — Cq,, and the slopes are '^^'j^^^l^ and j^f-^^, 
respectively. As shown in the figure, the last few random projections on Be have no much value, in 
comparison with the first few random projections on Be^i. Once after evaluating z^(fc) for all k, we can 
simply select the projections associated with p largest values, for the optimal choice. 




B. Examples 

Considering 256 x 256 images (d = 65, 536), we computed the individual capacity of bandwise random 
projections, in Fig. [8l for two cases, a = 0.32 and a = 0.49. For the optimal choice, we should arrange 
ij{k) in a non-increasing order and pick the first p projections. 

Note that the optimal set of projections varies according to the value of a. For a = 0.49, the lower 
frequency bands are far more favored - in other words, truly random projections are avoided - than for 
a = 0.32, due to the relative "denseness" of the source distribution. 

C. Noisy Measurement 

Next, we consider noisy measurements on the multi-resolution signal model. If we compute the pdf of 
x' = X + r]x in ( fT9l ) by convolving p{x) in (|2TI ) and G{r]x; 0, a^I) together, we obtain (see Appendix iDl) 

p(-') = n n ^T^^^ i^^] (24) 

for some unit-variance pdfs ^^(•). This implies that the independent components {v^} are preserved even 
with the addition of noise rj^ but the shape and variance of each projection s'^ = v'^x' change from those 
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(a) (b) 
Fig. 8. Individual capacity diagrams of bandwise random projections (a) for a = 0.32 and (b) for a = 0.49. 

of Sk- We see, in (l24l ). the variance increase uniformly by u^. The shape term of exactly becomes 
(l20l) only with A substituted by when k £ B£. 

The change in variances and shape terms seriously affects the individual capacity diagram. Fig. |9] 
shows the variation of the individual capacity diagram for the case a = 0.32 for various noise levels. 
As shown in the figure, the overall profile changes drastically even with a relatively small amount of 
noise. In particular, the slopes in high-frequency bands become flattened, which implies that the random 
mixing cannot overcome the barrier (variance gap) between the bands, and as a result, low-frequency 
components are favored. This provides yet another illustration of the observation [51 

V. Experiments: Informative Sensing of Natural Images 

In this section, we apply the UCA scheme (i.e. bandwise random projections), found in Section |lVl 
to natural images and make comparisons against other kinds of projections (e.g. PCA projection and 
random projections) in terms of signal reconstruction performance. To implement the proposed UCA 
scheme, we actually conduct the band decomposition in discrete cosine transfrom (DCT) domain, as 
illustrated in Fig. [TOl instead of explicitly using Gabor-like filters. The DCT kernels are also known to 
well approximate the principal components of natural images and each kernel in B£ represents some 
harmonic (deterministic) mixing of the independent components that lie in the frequencies between 

2£ f 2^ 
^ < 7- < ^ (25) 
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Fig. 9. Individual capacity diagrams for the Cameraman image for various noise level, (a) a = (no noise), (b) a = 5, (c) 
a — 10, (d) a — 20. Each pixel can have a value from to 255. The noise level may be understood in comparision with the 
maximum pixel value. 



where / = a//^ + /,? and fs denotes the sampUng frequency in both directions. Then, the band selection 



in DCT domain can effectively sift the independent components of the same resolution. In fact, the 
independent components are over-complete^ but we assume as if there were only a complete set of 
independent components orthogonal to each otherj^ Then, random mixing of DCT coefficients on a 

^In a specific band (resolution), each independent component corresponds to a local edge at a particular location and angle, 
''in other view, we are approximating the smooth power spectrum falling off as 1//^ by a bandwise-flat one. 
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Fig. 10. Illustration of band decomposition in DCT domain. 



specific band is treated equivalent to random mixing of the independent components in that band. 

To carry out random mixing, specifically we use a set of noiselets [25], binary-valued random matrix, 
for the efficient computer simulation. In doing so, for further ease of simulation, we make a slight 
modification to the UCA scheme. Given the total number of projections p, we determine the number of 
projections p£ for each band Bi utilizing the capacity diagram. In practice, we take all \B£\ projections 
if Pi > 0.9\Bi\, while taking none if p£ < 0.1\B£\, which removes the necessity of random mixing in 
both cases. Then, we take the remaining number of random projections across all the other bands at a 
time. 

The signal reconstruction experiments have been built on the basis of Romberg's implementation [26]. 
To reconstruct an image from measurements, we minimized the total variation (TV) of x, subject to 
y = Wx, defined by 

where X is the matrix representation of x. The TV minimization is known to perform better than the 
Li-norm minimization on the sparse basis (e.g. wavelets), avoiding high-frequency artifacts [26], [27]. 

The experimental results obtained for a couple of 256 x 256 images. Cameraman and Einstein, are 
shown in Fig. [TT] We compared the performance, in terms of peak-signal-to-noise ratio (PSNR), among 



(26) 
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five different schemes: linear reconstruction using low-pass DCT coefficients in zig-zag order (blue), 
nonlinear reconstruction using the same set of DCT coefficients (green), Romberg's method [26] which 
takes the first 1,000 DCT coefficients, also in zig-zag order, and switches to random projections (red), 
pure random projections (cyan), and bandwise random projections, which are the uncertain components 
we found for natural images (magenta). Except for the first one, TV minimization has been commonly 
used to recover images from each set of measurements. 

Unsurprisingly, in every case, the green curve (DCT with TV minimization) is above the blue (DCT 
with linear reconstruction), and the red curve (Ik DCT -i- random projections) is above the cyan (pure 
random projections). However, if we look at the green (DCT with TV minimization) and the red (Ik DCT 
-I- random projections), their relative performance changes completely, depending on the source image. 
Indeed, the two images turn out to have quite different characteristics in terms of their sparsity. The GG 
shape parameter was estimated, from their Haar wavelet coefficients, to a « 0.32 for the Cameraman 



image and to a 0.49 for the Einstein image, with their capacity diagrams corresponding to Fig. |8(a) 
and |8(b)[ respectively. 

The Cameraman image is quite sparse. A moderate number of random projections are capable of 
evenly grabbing image contents from all spatial frequencies. Meanwhile, the DCT projection uses up all 
available sensors only for the low-frequency contents, which could be captured with even fewer sensors, 
while missing nearly all high-frequency details. On the other hand, the Einstein image is not that sparse. 
As suggested by Fig. |8(b)[ we must deploy almost all sensors for low-frequency bands. Otherwise, even 
low-resolution version of the image cannot be recovered faithfully. Indeed, the DCT projection proves 
almost optimal for the Einstein image. 

The UCA projections (magenta) outperform Romberg's method (red) as well as the DCT projections 
(green). In principle, the UCA projections are expected to achieve at least the upper bound of the two. 
On occasion, however, for the Einstein image, the DCT projection was marginally better than the UCA 
projection. This is because the DCT projection is nearly optimal for the Einstein image and the UCA 
projection, based on (|2TI ). may suffer from inaccurate modeling artifacts. 

Fig. [12] shows the reconstruction results of 5,000 measurements (7.6% of the original dimension) on 
the Cameraman image, which portray the behavioural characteristics of each scheme. First, the image 
linearly reconstructed from DCT projections is blurred and also ringing. Such artifacts can be removed 
by employing a nonlinear reconstruction (i.e. TV minimization). However, the measurements were still 
concentrated on low-frequencies, so the image almost loses significant mid/high-frequency contents. In 
contrast, Romberg's method pursues high-frequency contents too hastily despite the seriously limited 
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number of measurements. Indeed, it somewhat succeeds in recovering high-frequency details, but with 
much sacrifice of the low/mid-frequency contents which is more important. Last, the UCA projection 
gives up the high-frequency contents but preserves low/mid-frequency contents quite faithfully. 

We did similar experiments also in noisy settings. In this case, we concatenated a denoising module 
based on field-of-experts image prior model [24] after the TV minimization, for all nonlinear reconstruc- 
tion schemes. Fig. [13] compares the performance of the five compressed sensing schemes for p = 21, 000 
at various noise levels. Note that, for the Cameraman image, having started worst among nonlinearly 
recovered schemes, the DCT projection (green) catches up and even exceeds all the others as a increases, 
while Romberg's method (red) as well as random projections (cyan) degrade fast. For the Einstein 
image, the DCT projection is persistently better than Romberg's method and random projections, and 
more remarkably, the degradation proceeds slowest. The UCA projection (magenta) finds the best set of 
projections throughout most range of noise levels, converging to the DCT projection as a increases. In 
the low SNR regime, the UCA projection worked slightly worse than the DCT projection, perhaps due 
to the inaccurate modeling artifacts again. 

Note that the UCA scheme uses different sets of projections depending on the sparsity of the source 
image and also on the noise level. In case that the sparsity of the source image is unknown, we might 
have to use a value learnt in advance, from a large collection of natural images. Then, we can achieve 
near-optimal performance in overall sense, but not in every individual case. In certain applications, it 
may be allowed to sense a few hundred Haar wavelet coefficients so that we can estimate the sparsity 
before we do tens of thousands of projections. 

VI. Conclusion 

Suppose we are allowed to take a small number of linear projections of signals in a dataset, and 
then use the projections plus our knowledge of the dataset to reconstruct the signals. What are the best 
projections to use? We have shown that these projections are not necessarily the principal components 
nor the independent components of the data nor random projections, but rather a new set of projections 
which we call uncertain components. We formalized this notion, informative sensing, by maximizing the 
mutual information between the signals and their projections. 

Then, we presented some analytical results which help us to understand the desirable behaviors for 
the informative sensing. For white data, the most informative projections are those that are as Gaussian 
as possible, in favor of random projections, while PCA or ICA can significantly outperform random 
projections for highly non-white data or in low SNR regime. In particular, for natural images, we showed 
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that more sensors should be reserved for low-frequency contents than used for high-frequency contents 
but not too many, which makes bandwise random projections most informative. 

Appendix 

A. Derivation of h{Wx) for the Objective Function in Noiseless Settings 

Consider the subspace Wj, not spanned by the row vectors of W , which represents unmeasured 
dimension of x, and define so that its row vectors be orthonormal bases for W_l. If we define 



y 


def 


Wx 




w 


z 




W^x 







x = Ux, 



(27) 



the pair of (y, z) corresponds to x in rotated bases because of the unitarity of U. Since we exactly 
measure y, some partial coordinates of x, the remaining job is to infer z using y at hand. In doing so, 
to reduce the uncertainty about z as much as possible, we seek to minimize h{z\y), and from the fact 
that h{x) = h{Ux) = h{y, z) = h{y) + h{z\y) and h{x) is fixed, minimizing h{z\y) is just equivalent to 
maximizing h{y), for the noiseless condition. 

B. Subvolume Expectation 

Let Km = {Ai, A2, . . . , Am} and define Sp{Km) as the sum of all the products made up of p elements 
in Am. Then, 

5'p(A,) 



i^[Volp(Ai,...,Ad) 



Note that 5p(Am) can be specified recursively by 







Sp{Am) = ^5p_i(Aj_i)Aj, p,m = 1,2, ... ,d, 



(28) 



(29) 



with S'o(-) = 1, which enables us to efficiently compute £'[Volp(Ai, . . . , A^)] by dynamic programming. 



C. Derivation of i\20\) 

Because x' = x + rjx = V^x + afj, where x ~ GG{a,0, 1) and fj ~ G(0, 1). the variance of x' is 
A + cr^, and by definition of the shape term, 

VXx + afj\ 



h 



VaT^ 



(30) 



which can be simply arranged into (ISOb . 
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D. Derivation of n24\l 

The pdf of x' = X + r]x can be computed by convolving p{x) and p{r]x) because rjx is independent of 
X. Changing the variable xhy s = V^x, where V = [vi ... Vd], and exploiting p{rix) = G{rix] 0, cj^/) = 

p{x) = j p{x)G{x' — x;0, a"^ I)dx 

= j p{Vs)G{x' -Vs;0,a^I)ds 

= lU^^ (^) n Givlx' - s,; 0, a')ds 
J Lk ^^^^^ e,k 



n / (^) G(vlx' - Sk-, 0, a^)dsk 

U^'iivlx), (31) 



where 



The variance (cr^)^ of = wjx' can be easily calculated by 



•;(t) = / ] G{t - uj- 0, a'^)duj. (32) 



(4)2 = Var (wjx + fjr?^) = A£ + C7^ V/c G (33) 



Finally, defining iI^[{t) as V'K''") ~ \/A£ + (t'^(P'i{\/ Xe + cr'^r), we obtain (|24ll . 



References 

[1] D. L. Donoho, "Compressed sensing," IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289-1306, Apr. 2006. 

[2] E. J. Candes and T. Tao, "Near-optimal signal recovery from random projections: Universal encoding strategies?" IEEE 

Trans. Inf. Theory, vol. 52, no. 12, pp. 5406-5425, Dec. 2006. 
[3] M. P. Duarte, M. B. Wakin, D. Baron, and R. G. Baraniuk, "Universal distributed sensing via random projections," in 

Proc. of International Conference on Information processing in Sensor Networks, Nashville, TN, Apr. 2006, pp. 177-185. 
[4] J. Haupt and R. Nowak, "Signal reconstruction from noisy random projections," IEEE Trans. Inf. Theory, vol. 52, no. 9, 

pp. 4036-4048, Sept. 2006. 

[5] M. J. Wainwright, "Sharp thresholds for high-dimensional and noisy recovery of sparsity," in In Proc. Allerton Conference 

on Communication, Control and Computing, 2006. 
[6] R. M. Castro, J. Haupt, R. Nowak, and G. M. Raz, "Finding needles in noisy haystacks," in Proc. IEEE Int. Conf. on 

Acoustics, Speech and Signal Processing, Mar. 2008, pp. 5133-5136. 
[7] M. Duarte, M. Davenport, D. Takhar, J. Laska, T. Sun, K. Kelly, and R. Baraniuk, "Single-pixel imaging via compressive 

sampling," IEEE Signal Process. Mag., vol. 25, no. 2, pp. 83-91, Mar. 2008. 



January 27, 2009 



DRAFT 



23 



[8] M. Lustig, J. M. Santos, D. Donoho, and J. M. Pauly, "k-t sparse: High frame rate dynamic mri exploiting spatio-temporal 

sparsity," in Proc. Annual Meeting of ISMRM, 2006. 
[9] M. W. Seeger, H. Nickisch, R. Pohmann, and B. Scholkopf, "Bayesian experimental design of magnetic resonance imaging 

sequences," in Advances in Neural Information Processing Systems, 2008. 
[10] T. Lin and F. J. Herrmann, "Compressed wavefield extrapolation," Geophysics, vol. 72, no. 5, pp. SM77-SM93, Sept./Oct. 

2007. 

[11] A. J. Bell and T. J. Sejnowski, "Edges are the independent components of natural scenes," in Advances in Neural Information 

Processing Systems, vol. 9, 1997, pp. 831-837. 
[12] R. Linsker, "An application of the principle of maximum information preservation to linear systems," in Advances in Neural 

Information Processing Systems, vol. 1, 1989, pp. 186-194. 
[13] F. Attneave, "Informational aspects of visual perception," Psych. Rev., vol. 61, pp. 183-193, 1954. 
[14] H. B. Barlow, "Possible principles underlying the transformation of sensory messages," in Sensory Communications, W. A. 

Rosenblith, Ed. Cambridge, MA: MIT Press, 1961, pp. 217-234. 
[15] J. J. Atick, "Could information theory provide an ecological theory of sensory processing?" Network: Comput. Neural 

Syst., vol. 3, pp. 213-251, 1992. 
[16] Y. Weiss, H. S. Chang, and W. T. Freeman, "Learning compressed sensing," in Proc. Allerton Conf. on Communication, 

Control, and Computing, Sept. 2007. 
[17] M. W. Seeger and H. Nickisch, "Compressed sensing and Bayesian experimental design," in Proc. Int. Conf. on Machine 

Learning, June 2008, pp. 912-919. 
[18] S. Dasgupta, D. Hsu, and N. Verma, "A concentration theorem for projections," in Proc. of 22nd Conf on Uncertainty in 

Artificial Intelligence, July 2006. 
[19] S. G. Mallat, "A theory for multiresolution signal decomposition: The wavelet representation," IEEE Trans. Pattern Anal. 

Machine IntelL, vol. 11, no. 7, pp. 674-693, July 1989. 
[20] M. Bethge, "Factorial coding of natural images: How effective are linear models in removing higher-order dependencies?" 

J. Opt. Soc. Am. A, vol. 23, no. 6, pp. 1253-1268, June 2006. 
[21] A. van der Schaaf and J. van Hateren, "Modelling the power spectra of natural images: statistics and information," Vision 

Research, vol. 36, no. 17, pp. 2759-2770, 1996. 
[22] B. A. Olshausen and D. J. Field, "Emergence of simple-cell receptive field properties by learning a sparse code for natural 

images." Nature, vol. 381, pp. 607-609, Jtme 1996. 
[23] E. P. Simoncelli, "Statistical models for images: Compression, restoration and synthesis," in Proc. Asilomar Conf. on 

Signals, Systems and Computers, Nov. 1997, pp. 673-678. 
[24] Y. Weiss and W. T. Freeman, "What makes a good model of natural images?" in Proc. IEEE Conf. on Computer Vision 

and Pattern Recognition, Minneapolis, MN, June 2007. 
[25] E. J. Candes and J. Romberg, "Sparsity and incoherence in compressive sampling," Inverse Prob., vol. 23, no. 3, pp. 

969-986, June 2007. 

[26] J. Romberg, "Imaging via compressive sampling," IEEE Signal Process. Mag., vol. 25, no. 2, pp. 14-20, Mar. 2008. 
[27] R. Berinde and P. Indyk, "Sparse recovery using sparse random matrices," MIT, Tech. Rep., 2008. 



January 27, 2009 



DRAFT 



24 



42 




(b) 



Fig. 11. Experimental results on two images: (a) Cameraman and (b) Einstein. Compared schemes are DCT with linear 
reconstruction (blue), DCT with TV minimization (green), Ik DCT + random (red), pure random (cyan), and UCA projection 
(magenta). 
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Fig. 13. Experimental results, in noisy conditions, on two images: (a) Cameraman and (b) Einstein. Compared schemes are 
DCT with linear reconstruction (blue), DCT with TV minimization (green), Ik DCT + random (red), pure random (cyan), and 
UCA projection (magenta). The number of measurements are set to p = 21, 000. 
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